Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

66 ◾ Bioinformatics

by one of the two-letter header record type codes. A record type code may have two-letter

subtype codes. Table 2.1 lists and describes the two-letter codes of the SAM header section,

and Figure 2.14 shows an example header section. Notice that the SAM file begins with

“@HD VN:1.0 SO:coordinate”, which indicates that the specification of SAM version 1.0

was used and the alignments in the file are sorted by the coordinate. We can also notice

that there are several @SQ header lines, each line is for the reference sequence used for the

alignment. The @SQ header includes the sequence name (SN), which is the chromosome

number, and sequence length (LN). The last two lines in the header section should include

@PG, which describes the program used for the alignment, and @CO, which describes the

command lines.

The alignment section begins after the header section. Each alignment line has 11

mandatory fields to store the essential alignment information. The alignment section may

have variable number of optional fields, which are used to provide additional and aligner

specific information.

Figure 2.15 shows a partial alignment section of a SAM file. The columns of the align-

ment section are split because they do not fit the page. Table 2.2 lists and describes 11 man-

datory fields of the SAM alignment section.

These 11 mandatory fields are always present in a SAM file. If the information of

any of these mandatory fields is not available, the value of that field will be replaced

with “0” if its data type is integer or “*” if the data type is string. Most field names are

self-explanatory.

TABLE 2.1 The Two-Letter Codes of the Header Section and Their Description

Code

Header Code Description

@HD

This header codes for metadata, and if it is present, it must be the first line of the SAM file. This

header line may include subtypes: VN for format version, SO for sorting order, GO for

grouping alignment, and SS for sub-sorting order of alignments

@SQ

This is for the reference sequence used for aligning the reads. A SAM file may include multiple

@SQ lines for the reference sequences used. The order of the sequences defines the order of

alignment sorting. The two most common sub-type codes used in this header line include SN

for reference sequence name and LN for reference sequence length

@RG

This header line is used to identify read group and it is used by some downstream analysis

programs (e.g., GATK) for grouping files based on the study design. Multiple lines can exist in

a SAM file. This line may include the ID for the unique read group identifier, BC for the

barcode sequence identifying the sample, CN for the name of sequencing facility, DS for

description to be used for the read group, DT for the date of sequencing, LB for the

sequencing library, PG for the programs used for processing the read group, PL for the

platform of the sequencing technology used to generate the reads, PM for the platform model,

PU for the platform unit, which is a unique identifier (e.g., flow cell/slide barcode), and SM

for the sample identifier, which is the pool name where a pool is being sequenced

@PG

This is the header line for describing the program used to align the reads. It may include ID for

the program unique record identifier, PN for the program name, CL for the command line

used to run the program, PP for the previous @PG-ID, DS for description, and VN for the

program version

@CO

This is the header line for a text comment. Multiple @CO lines are allowed